13 research outputs found

    A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition

    Full text link
    The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: https://github.com/AndongDeng/BEARComment: ICCV 202

    Towards Efficient and Effective Representation Learning for Image and Video Understanding

    Get PDF
    Deep learning has achieved tremendous success on various computer vision tasks. However, deep learning methods and models are usually computationally expensive, making it hard to train and deploy, especially on resource-constrained devices. In this dissertation, we explore how to improve the efficiency and effectiveness of deep learning methods from various perspectives. We first propose a new learning method to learn computationally adaptive representations. Traditional neural networks are static. However, our method trains adaptive neural networks that can adjust their computational cost during runtime, avoiding the need to train and deploy multiple networks for dynamic resource budgets. Next, we extend our method to learn adaptive spatiotemporal representations to solve various video understanding tasks such as video recognition and action detection. Then, inspired by the proposed adaptive learning method, we propose a new regularization method to learn better representations for the full network. Our method regularizes the full network by ensuring that its predictions align with those of its sub-networks when fed with differently transformed input data. This approach facilitates the learning of more generalized and robust representations by the full network. Besides learning methods, designing good network architecture is also critical to learn good representations. Neural architecture search (NAS) has shown great potential in designing novel network structures, but its high computational cost is a significant limitation. To address this issue, we present a new short-training based NAS method that achieves superior performance compared to previous methods, while requiring significantly less search cost. Finally, with the recent advancements in large-scale image foundation models, we present an efficient finetuning method to adapt pre-trained image foundation models for video understanding. Our method significantly reduces training costs compared to traditional full fine-tuning, while delivering competitive performance across multiple video benchmarks. It is both simple and versatile, making it easy to leverage stronger image foundation models in the future

    Exploring Parameter-Efficient Fine-tuning for Improving Communication Efficiency in Federated Learning

    Full text link
    Federated learning (FL) has emerged as a promising paradigm for enabling the collaborative training of models without centralized access to the raw data on local devices. In the typical FL paradigm (e.g., FedAvg), model weights are sent to and from the server each round to participating clients. However, this can quickly put a massive communication burden on the system, especially if more capable models beyond very small MLPs are employed. Recently, the use of pre-trained models has been shown effective in federated learning optimization and improving convergence. This opens the door for new research questions. Can we adjust the weight-sharing paradigm in federated learning, leveraging strong and readily-available pre-trained models, to significantly reduce the communication burden while simultaneously achieving excellent performance? To this end, we investigate the use of parameter-efficient fine-tuning in federated learning. Specifically, we systemically evaluate the performance of several parameter-efficient fine-tuning methods across a variety of client stability, data distribution, and differential privacy settings. By only locally tuning and globally sharing a small portion of the model weights, significant reductions in the total communication overhead can be achieved while maintaining competitive performance in a wide range of federated learning scenarios, providing insight into a new paradigm for practical and effective federated systems
    corecore